Skip to content

fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status#9754

Merged
mudler merged 2 commits into
masterfrom
fix/stale-node-models-cleanup
May 13, 2026
Merged

fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status#9754
mudler merged 2 commits into
masterfrom
fix/stale-node-models-cleanup

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Stale node_models rows (state="loaded") were surviving past the healthy state of their owning node, causing /embeddings (and other inference paths) to dispatch to a backend whose process was gone or drained. The downstream symptom was pgvector rejecting inserts as vector cannot have more than 16000 dimensions (SQLSTATE 54000) because the misbehaving backend silently returned a malformed (oversized) tensor; the Models page showed the model as "running" without an associated node, like a stale entry, even though the node was no longer visible in the Nodes view.

Why

Repro path observed live:

  • qwen3-embedding-4b was registered on a worker that had been operationally drained (process gone, but node-level heartbeats still arriving).
  • FindAndLockNodeWithModel picked the stale node_models row because the JOIN only filtered by node_models.state = "loaded"; the status-check happened in a second query whose failure rolled back the transaction but didn't prevent the same row from being picked next time.
  • probeHealth (gRPC ping) passed against the drained worker.
  • The backend returned an oversized tensor → /embeddings returned 200 → LocalRecall's pgvector insert blew up at the 16000-dim ceiling.
  • MarkUnhealthy / MarkDraining only flipped backend_nodes.status; unlike MarkOffline, they didn't cascade-clean node_models. Stale rows accumulated.

Changes (defense-in-depth)

  • MarkUnhealthy / MarkDraining cascade-delete node_models — mirrors MarkOffline. On the unhealthy path the backend is presumed unreachable; on drain, routing already filters these out, but clearing the rows stops the UI from misreporting and prevents future scheduling-logic relaxations from picking them up.
  • FindAndLockNodeWithModel inner JOIN now also filters by backend_nodes.status = healthy — the previous version relied on the second node-fetch step to reject non-healthy nodes, but a concurrent reader could still pick the same stale row in the same window. Belt-and-braces.
  • DistributedConfig.PerModelHealthCheckDisablePerModelHealthCheck (inverted; on by default) — the per-model gRPC probe independently health-checks each model's gRPC address and removes stale node_models rows when the backend has crashed even though the worker's node-level heartbeat is still arriving.

Migration

DistributedConfig.PerModelHealthCheck was renamed and inverted to DisablePerModelHealthCheck. The field had no CLI flag, env var binding, or YAML key in tree (only the bare struct field), so there's no user-facing migration. Anything constructing DistributedConfig in code needs to drop the assignment (default now does the right thing) or invert it.

Test plan

  • go test ./core/services/nodes/ (100s, all pass)
  • go vet ./core/services/nodes/ ./core/config/ ./core/application/ clean
  • Live verification in the dev cluster where the original symptom was observed: drain a backend hosting an embedding model, confirm /embeddings either returns an error (no replica) or routes to a healthy one — no 200-with-garbage, no pgvector dimension overflow downstream.

🤖 Generated with Claude Code

mudler added 2 commits May 11, 2026 08:24
… routing by healthy status

Stale node_models rows (state="loaded") were surviving past the healthy
state of their owning node, causing /embeddings (and other inference
paths) to dispatch to a backend whose process was gone or drained. The
downstream symptom in a live cluster was pgvector rejecting inserts
with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)"
because the misbehaving backend silently returned a malformed
(oversized) tensor; the Models page showed the model as "running"
without an associated node, like a stale entry, even though the node
was no longer visible in the Nodes view.

Two changes here, plus a third in a follow-up commit:

- MarkDraining now cascade-deletes node_models rows for the affected
  node, mirroring MarkOffline. Drains are explicit operator actions —
  the box has been intentionally taken out of rotation — so clearing
  the rows stops the Models UI from misreporting and prevents the
  routing layer from picking those rows if scheduling logic is ever
  relaxed. In-flight requests already hold their gRPC client through
  Route() and finish normally; the only observable effect is a
  non-fatal IncrementInFlight warning, acceptable for a drain.

  MarkUnhealthy is deliberately left status-only: it fires from
  managers_distributed / reconciler on a single nats.ErrNoResponders
  with no retry, so a transient NATS hiccup must not nuke every loaded
  model and force a full reload on recovery.

- FindAndLockNodeWithModel's inner JOIN now filters on
  backend_nodes.status = healthy in addition to node_models.state =
  loaded. The previous version relied on the second node-fetch step to
  reject non-healthy nodes, but a concurrent reader could still pick
  the same stale row in the same window. Belt-and-braces.

- DistributedConfig.PerModelHealthCheck renamed to
  DisablePerModelHealthCheck and inverted at the call site so
  per-model gRPC probing is on by default. The probe (now made
  consecutive-miss aware in a follow-up commit) independently health-
  checks each model's gRPC address and removes stale node_models rows
  when the backend has crashed even though the worker's node-level
  heartbeat is still arriving.

  Migration: the field had no CLI flag, env var binding, or YAML key
  in tree (only the bare struct field), so there is no user-facing
  migration. Anything constructing DistributedConfig in code needs to
  drop the assignment (default now does the right thing) or invert it.

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…emoves a row

The per-model gRPC probe used to remove a node_models row on a single
failed health check. With the per-model probe now on by default, that
made any 5-second gRPC blip (network jitter, a long-running request
hogging the worker's gRPC server thread, brief GC pause) trigger a
full reload of the affected model — too eager for production.

Require perModelMissThreshold (3) consecutive failed probes before
removal. At the default 15s tick a model must be unreachable for ~45s
before reap; a single successful probe in between resets the streak.
Per-(node, model, replica) state tracked under a mutex on the monitor.

If the removal call itself fails, the miss counter is left in place
so the next tick retries rather than starting the streak over.

Tests:
- removes stale model via per-model health check after consecutive
  failures (replaces the single-shot expectation)
- preserves model row when an intermittent failure is followed by a
  success (covers the reset-on-success path and verifies the counter
  reset by failing twice more without crossing threshold)
- newTestHealthMonitor initializes the misses map so direct-construct
  test helpers don't nil-map-panic in the probe path

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler force-pushed the fix/stale-node-models-cleanup branch from 78fb08b to 9029ec2 Compare May 11, 2026 08:37
@mudler mudler merged commit b4fdb41 into master May 13, 2026
54 checks passed
@mudler mudler deleted the fix/stale-node-models-cleanup branch May 13, 2026 19:57
@mudler mudler added the bug Something isn't working label May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants